{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f5ce62c5",
   "metadata": {},
   "source": [
    "\n",
    "# Predictive Biostatistics: Regression and Logistic Models\n",
    "\n",
    "This notebook explores predictive models for clinical variables, including linear and logistic regression. Links to the dataset are provided for easy access.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a401356b",
   "metadata": {},
   "source": [
    "\n",
    "## Dataset Information and Download Links\n",
    "\n",
    "The analysis in this notebook uses a **Diabetes dataset**, which can be downloaded from the following source:\n",
    "\n",
    "1. **Kaggle:**\n",
    "   - [Diabetes Dataset - Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data)\n",
    "   - This dataset includes detailed patient data for diabetes prediction and analysis.\n",
    "\n",
    "### Dataset Attributes\n",
    "\n",
    "- **Pregnancies**: Number of pregnancies.\n",
    "- **Glucose**: Plasma glucose concentration.\n",
    "- **BloodPressure**: Diastolic blood pressure (mm Hg).\n",
    "- **SkinThickness**: Triceps skinfold thickness (mm).\n",
    "- **Insulin**: 2-Hour serum insulin (mu U/ml).\n",
    "- **BMI**: Body mass index (weight in kg/(height in m)^2).\n",
    "- **DiabetesPedigreeFunction**: Diabetes pedigree function.\n",
    "- **Age**: Age of the patient.\n",
    "- **Outcome**: Class variable (0 = non-diabetic, 1 = diabetic).\n",
    "\n",
    "### Usage Notes\n",
    "\n",
    "- Preprocess the dataset as needed (e.g., handle missing values).\n",
    "- Refer to the [dataset documentation](https://www.kaggle.com/datasets/mathchi/diabetes-data) for detailed information.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e80e203",
   "metadata": {},
   "source": [
    "\n",
    "## Univariate Linear Regression: HbA1c ~ BMI\n",
    "\n",
    "A simple linear regression model is created to predict HbA1c levels (proxy: glucose) based on BMI.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33bccbc3",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import pandas as pd\n",
    "import statsmodels.formula.api as smf\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Load the dataset (replace with your file path)\n",
    "data = pd.read_csv(r'C:\\Path\\to\\diabetes.csv')\n",
    "\n",
    "# Univariate Linear Regression\n",
    "model1 = smf.ols(formula='Glucose ~ BMI', data=data).fit()\n",
    "print(model1.summary())\n",
    "\n",
    "# Scatter plot with regression line\n",
    "sns.regplot(x='BMI', y='Glucose', data=data, ci=95)\n",
    "plt.title(\"Linear Regression: Glucose vs BMI\")\n",
    "plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0aedf91d",
   "metadata": {},
   "source": [
    "\n",
    "## Logistic Regression: Predicting Diabetes Using Glucose Levels\n",
    "\n",
    "A logistic regression model is created to predict diabetes status based on glucose levels.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "187b0d92",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import statsmodels.api as sm\n",
    "\n",
    "# Logistic Regression\n",
    "data['Intercept'] = 1\n",
    "logit_model = sm.Logit(data['Outcome'], data[['Glucose', 'Intercept']]).fit()\n",
    "\n",
    "# Print logistic regression summary\n",
    "print(logit_model.summary())\n",
    "\n",
    "# Plot probabilities vs glucose levels\n",
    "fitted_probabilities = logit_model.predict(data[['Glucose', 'Intercept']])\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.scatterplot(x=data['Glucose'], y=fitted_probabilities)\n",
    "plt.title(\"Logistic Regression: Predicted Probabilities vs Glucose Levels\")\n",
    "plt.xlabel(\"Glucose Level\")\n",
    "plt.ylabel(\"Predicted Probability\")\n",
    "plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f631280f",
   "metadata": {},
   "source": [
    "\n",
    "## Multivariate Linear Regression: HbA1c ~ BMI + Age + Diabetes Pedigree\n",
    "\n",
    "A multivariate regression model is created to predict HbA1c levels (proxy: glucose) based on BMI, age, and diabetes pedigree.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57b4628a",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Multivariate Linear Regression\n",
    "model2 = smf.ols(formula='Glucose ~ BMI + Age + DiabetesPedigreeFunction', data=data).fit()\n",
    "print(model2.summary())\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75cec7a6",
   "metadata": {},
   "source": [
    "\n",
    "## Multivariate Logistic Regression: Diabetes Prediction Using Multiple Variables\n",
    "\n",
    "A multivariate logistic regression model predicts diabetes using glucose, BMI, age, and blood pressure.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cccce31b",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Multivariate Logistic Regression\n",
    "predictors = ['Glucose', 'BMI', 'Age', 'BloodPressure', 'Intercept']\n",
    "logit_model_multivariate = sm.Logit(data['Outcome'], data[predictors]).fit()\n",
    "\n",
    "# Print logistic regression summary\n",
    "print(logit_model_multivariate.summary())\n",
    "        "
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}
